Monagas State
World Modeling with Probabilistic Structure Integration
Kotar, Klemen, Lee, Wanhee, Venkatesh, Rahul, Chen, Honglin, Bear, Daniel, Watrous, Jared, Kim, Simon, Aw, Khai Loong, Chen, Lilian Naing, Stojanov, Stefan, Feigelis, Kevin, Thobani, Imran, Durango, Alex, Jedoui, Khaled, Kazemian, Atlas, Yamins, Dan
We present Probabilistic Structure Integration (PSI), a system for learning richly controllable and flexibly promptable world models from data. PSI consists of a three-step cycle. The first step, Probabilistic prediction, involves building a probabilistic graphical model Psi of the data, in the form of a random-access autoregressive sequence model. Psi supports a complete set of learned conditional distributions describing the dependence of any variables in the data on any other set of variables. In step 2, Structure extraction, we show how to extract underlying low-dimensional properties in the data, corresponding to a diverse set of meaningful "intermediate structures", in a zero-shot fashion via causal inference on Psi. Step 3, Integration, completes the cycle by converting these structures into new token types that are then continually mixed back into the training diet as conditioning signals and prediction targets. Each such cycle augments the capabilities of Psi, both allowing it to model the underlying data better, and creating new control handles -- akin to an LLM-like universal prompting language. We train an instance of Psi on 1.4 trillion tokens of internet video data; we use it to perform a variety of useful video prediction and understanding inferences; we extract state-of-the-art optical flow, self-supervised depth and object segmentation; and we use these structures to support a full cycle of predictive improvements.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- South America > Venezuela > Monagas State > Maturin (0.04)
- North America > United States > California (0.04)
- (2 more...)
Adapting Large Language Models via Reading Comprehension
Cheng, Daixuan, Huang, Shaohan, Wei, Furu
We explore how continued pre-training on domain-specific corpora influences large language models, revealing that training on the raw corpora endows the model with domain knowledge, but drastically hurts its prompting ability for question answering. Taken inspiration from human learning via reading comprehension--practice after reading improves the ability to answer questions based on the learned knowledge--we propose a simple method for transforming raw corpora into reading comprehension texts. Each raw text is enriched with a series of tasks related to its content. Our method, highly scalable and applicable to any pre-training corpora, consistently enhances performance across various tasks in three different domains: biomedicine, finance, and law. Notably, our 7B language model achieves competitive performance with domain-specific models of much larger scales, such as BloombergGPT-50B. Furthermore, we demonstrate that domain-specific reading comprehension texts can improve the model's performance even on general benchmarks, showing the potential to develop a general model across even more domains. Our model, code, and data will be available at https://github.com/microsoft/LMOps.
- North America > United States > New York > New York County > New York City (0.04)
- South America > Venezuela > Monagas State > Maturin (0.04)
- North America > United States > Illinois (0.04)
- Law (1.00)
- Education > Assessment & Standards > Student Performance (1.00)
- Health & Medicine > Therapeutic Area > Oncology (0.69)
HaVQA: A Dataset for Visual Question Answering and Multimodal Research in Hausa Language
Parida, Shantipriya, Abdulmumin, Idris, Muhammad, Shamsuddeen Hassan, Bose, Aneesh, Kohli, Guneet Singh, Ahmad, Ibrahim Said, Kotwal, Ketan, Sarkar, Sayan Deb, Bojar, Ondřej, Kakudi, Habeebah Adamu
This paper presents HaVQA, the first multimodal dataset for visual question-answering (VQA) tasks in the Hausa language. The dataset was created by manually translating 6,022 English question-answer pairs, which are associated with 1,555 unique images from the Visual Genome dataset. As a result, the dataset provides 12,044 gold standard English-Hausa parallel sentences that were translated in a fashion that guarantees their semantic match with the corresponding visual information. We conducted several baseline experiments on the dataset, including visual question answering, visual question elicitation, text-only and multimodal machine translation.
- Europe > Switzerland > Zürich > Zürich (0.14)
- Africa > Nigeria > Jigawa State > Dutse (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (32 more...)
- Health & Medicine (0.93)
- Information Technology (0.68)
The Venezuelans Trying to Escape Their Country Through Video Game Grunt Work
On a recent afternoon in Maracaibo, Venezuela, Alexander Marinez, who has short-cropped black hair and three-to-four-day stubble, sat in front of his computer tracking herbiboars in the mushroom forests on Fossil Island. He pressed down on his glowing mouse, the newest addition to his otherwise timeworn gaming setup. The pixelated character on his computer screen followed the tracks of a hedgehoglike creature with triangular tusks and herbs growing out of its back. Outside Marinez's one-story house, the sun bore down on the dirt road. His home lies about six miles away from the strait that connects the Caribbean Sea with Lake Maracaibo, one of the world's richest sources of oil. The character inspected a tunnel. Suddenly, the herbiboar appeared, and the character attacked, stunning it.
- South America > Venezuela > Zulia State > Maracaibo (0.46)
- Atlantic Ocean > Caribbean Sea (0.25)
- South America > Venezuela > Lake Maracaibo (0.24)
- (13 more...)
- Leisure & Entertainment > Games > Computer Games (1.00)
- Government (1.00)
- Banking & Finance (1.00)
- Information Technology > Communications (0.95)
- Information Technology > Artificial Intelligence > Games (0.51)
Pull out all the stops: Textual analysis via punctuation sequences
Darmon, Alexandra N. M., Bazzi, Marya, Howison, Sam D., Porter, Mason A.
Whether enjoying the lucid prose of a favorite author or slogging through some other writer's cumbersome, heavy-set prattle (full of parentheses, em dashes, compound adjectives, and Oxford commas), readers will notice stylistic signatures not only in word choice and grammar, but also in punctuation itself. Indeed, visual sequences of punctuation from different authors produce marvelously different (and visually striking) sequences. Punctuation is a largely overlooked stylistic feature in "stylometry", the quantitative analysis of written text. In this paper, we examine punctuation sequences in a corpus of literary documents and ask the following questions: Are the properties of such sequences a distinctive feature of different authors? Is it possible to distinguish literary genres based on their punctuation sequences? Do the punctuation styles of authors evolve over time? Are we on to something interesting in trying to do stylometry without words, or are we full of sound and fury (signifying nothing)?
- North America > United States > California > Los Angeles County > Los Angeles (0.28)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.27)
- North America > United States > New York > New York County > New York City (0.14)
- (16 more...)
- Government > Regional Government (0.45)
- Media > Music (0.45)
1000 novels everyone must read: Science Fiction & Fantasy (part two)
When Haldeman returned from Vietnam, with a Purple Heart for the wounds he had suffered, he wrote a story about a pointless conflict that seems as if it will never end. It was set in space, and the enemies were aliens, but 18 publishers decided it was too close to home before St Martin's Press took a gamble. The book that "nobody wants to read" went on to win many prizes. It's not perfect - it's hard to take seriously a future in which hetereosexuality is a perversion - but the anti-war message is as powerful as ever. Known for his intricate short stories and critically acclaimed mountaineering novel Climbers, Harrison cut his teeth on SF. In typical fashion, he writes space opera better than many who write only in the genre. For all its star travel and alien artefacts, scuzzy 25th-century spaceports and drop-out space pilots, Light is actually about twisting three plotlines as near as possible to snapping point. This is as close as SF gets to literary fiction, and literary fiction gets to SF. Jon Courtenay Grimwood Buy this book at the Guardian bookshop Amateur stonemason, waterbed designer, reformed socialist, nudist, militarist and McCarthyite, Heinlein is one of the most interesting and irritating figures in American science fiction.
- Media (1.00)
- Leisure & Entertainment (1.00)
- Government (1.00)